1) Understand the concept of time series data and what does it imply
2) Process time series datasets into effective dataframes
3) Produce running bar charts using a variety of tools
Simply put, datasets that describe/contain data which varies over time
They are usually characterised by a fixed and constant aggregation of data
Readings from sensors in the industry
Coronavirus cases
Trends in economics
The art of using these datasets resides in knowing how to shape them and using the right features
To practice, we will use the dataset called bitcoin.csv
This data set contains the price of the Bitcoin cryptocurrency on the US-based exchange Gemini, minute-by-minute for all of 2017
import pandas as pd
import numpy as np
bitcoin = pd.read_csv('https://www.dropbox.com/s/kxjc92oia6gcoy8/bitcoin.csv?raw=1', index_col=0)
bitcoin
| Open | High | Low | Close | Volume | |
|---|---|---|---|---|---|
| Date | |||||
| 2017-01-01 00:00:00 | 974.55 | 974.55 | 974.55 | 974.55 | 0.000000 |
| 2017-01-01 00:01:00 | 974.55 | 974.55 | 974.55 | 974.55 | 0.000000 |
| 2017-01-01 00:02:00 | 974.55 | 974.55 | 970.00 | 970.00 | 0.417679 |
| 2017-01-01 00:03:00 | 970.00 | 970.00 | 970.00 | 970.00 | 0.000514 |
| 2017-01-01 00:04:00 | 970.00 | 970.00 | 970.00 | 970.00 | 0.000000 |
| ... | ... | ... | ... | ... | ... |
| 2017-12-31 23:55:00 | 13782.87 | 13782.87 | 13771.00 | 13771.00 | 1.657980 |
| 2017-12-31 23:56:00 | 13771.00 | 13775.00 | 13770.95 | 13775.00 | 7.538428 |
| 2017-12-31 23:57:00 | 13775.00 | 13815.37 | 13775.00 | 13815.37 | 18.437304 |
| 2017-12-31 23:58:00 | 13815.37 | 13825.00 | 13815.37 | 13825.00 | 3.547593 |
| 2017-12-31 23:59:00 | 13825.00 | 13825.00 | 13804.68 | 13820.26 | 2.610719 |
525600 rows × 5 columns
Note that the Date column contains entries formatted as a date-time such as 2017-01-01 00:00:00
First, let's check the type of this data by examining the first item
first_index_value = bitcoin.index[0]
print(first_index_value)
print(type(first_index_value))
2017-01-01 00:00:00 <class 'str'>
We can avoid this in one go by parsing the dates when loading
bitcoin = pd.read_csv('https://www.dropbox.com/s/kxjc92oia6gcoy8/bitcoin.csv?raw=1',
parse_dates=['Date'], index_col=0)
bitcoin
| Open | High | Low | Close | Volume | |
|---|---|---|---|---|---|
| Date | |||||
| 2017-01-01 00:00:00 | 974.55 | 974.55 | 974.55 | 974.55 | 0.000000 |
| 2017-01-01 00:01:00 | 974.55 | 974.55 | 974.55 | 974.55 | 0.000000 |
| 2017-01-01 00:02:00 | 974.55 | 974.55 | 970.00 | 970.00 | 0.417679 |
| 2017-01-01 00:03:00 | 970.00 | 970.00 | 970.00 | 970.00 | 0.000514 |
| 2017-01-01 00:04:00 | 970.00 | 970.00 | 970.00 | 970.00 | 0.000000 |
| ... | ... | ... | ... | ... | ... |
| 2017-12-31 23:55:00 | 13782.87 | 13782.87 | 13771.00 | 13771.00 | 1.657980 |
| 2017-12-31 23:56:00 | 13771.00 | 13775.00 | 13770.95 | 13775.00 | 7.538428 |
| 2017-12-31 23:57:00 | 13775.00 | 13815.37 | 13775.00 | 13815.37 | 18.437304 |
| 2017-12-31 23:58:00 | 13815.37 | 13825.00 | 13815.37 | 13825.00 | 3.547593 |
| 2017-12-31 23:59:00 | 13825.00 | 13825.00 | 13804.68 | 13820.26 | 2.610719 |
525600 rows × 5 columns
If we check the column now, it is marked as Timestamp
first_index_value = bitcoin.index[0]
print(first_index_value)
print(type(first_index_value))
2017-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
This data set contains the following columns:
Price Data:
Trade Data:
Other time series data sets with have other columns, but this is common for financial data.
Even though the index is stored as Timestamp, we can retrieve rows by querying with a string
For example during the minute 2017-05-27 19:30:00, we can query using 2017-05-27 19:30:00 which matches the data
However, since the data set has only minute-by-minute data, the seconds don't matter, so we can exclude the seconds in our query and search for 2017-05-27 19:30
bitcoin.loc['2017-05-27 19:30']
Open 2144.68000 High 2144.68000 Low 2141.26000 Close 2142.53000 Volume 0.05406 Name: 2017-05-27 19:30:00, dtype: float64
The price started this minute at 2144.68 USD, dropped to 2142.53 USD at the end of the minute
During this time the minimum value was 2141.26 USD and the maximum value was 2144.68 USD
0.05406 bitcoin was bought/sold during this minute
Let's check the next minute 2017-05-27 19:31
bitcoin.loc['2017-05-27 19:31']
Open 2142.5300 High 2142.5300 Low 2141.2700 Close 2141.2700 Volume 0.2069 Name: 2017-05-27 19:31:00, dtype: float64
The previous minute closed at 2142.53 USD and this minute opened at 2142.53 USD
This is expected since the end of one minute is the same as the start of the next minute!
If we want to check the open price for the day and close price for the day, we could check the open price for the first minute and close price for the last minute
print('Date: 2017-05-27')
print('Open:', bitcoin.at['2017-05-27 00:00', 'Open'])
print('Close:', bitcoin.at['2017-05-27 23:59', 'Close'])
Date: 2017-05-27 Open: 2267.37 Close: 2096.77
We can select an interval of time the same way we do with numbers
With the following code, we select all of the values in May of 2017
bitcoin[(bitcoin.index >= '2017-05') & (bitcoin.index < '2017-06')]
| Open | High | Low | Close | Volume | |
|---|---|---|---|---|---|
| Date | |||||
| 2017-05-01 00:00:00 | 1362.63 | 1362.63 | 1359.05 | 1359.05 | 0.129530 |
| 2017-05-01 00:01:00 | 1359.05 | 1359.05 | 1359.05 | 1359.05 | 0.015945 |
| 2017-05-01 00:02:00 | 1359.05 | 1359.05 | 1359.05 | 1359.05 | 0.430274 |
| 2017-05-01 00:03:00 | 1359.05 | 1359.05 | 1359.05 | 1359.05 | 0.024931 |
| 2017-05-01 00:04:00 | 1359.05 | 1359.05 | 1359.05 | 1359.05 | 0.000000 |
| ... | ... | ... | ... | ... | ... |
| 2017-05-31 23:55:00 | 2301.80 | 2301.80 | 2301.80 | 2301.80 | 0.000000 |
| 2017-05-31 23:56:00 | 2301.80 | 2301.80 | 2301.80 | 2301.80 | 0.065392 |
| 2017-05-31 23:57:00 | 2301.80 | 2301.80 | 2301.80 | 2301.80 | 0.000000 |
| 2017-05-31 23:58:00 | 2301.80 | 2301.80 | 2301.80 | 2301.80 | 0.000000 |
| 2017-05-31 23:59:00 | 2301.80 | 2303.62 | 2301.80 | 2303.62 | 13.587863 |
44640 rows × 5 columns
Selecting all of the values for 1st May before noon
bitcoin[(bitcoin.index >= '2017-05-01') & (bitcoin.index < '2017-05-01 12:00')]
| Open | High | Low | Close | Volume | |
|---|---|---|---|---|---|
| Date | |||||
| 2017-05-01 00:00:00 | 1362.63 | 1362.63 | 1359.05 | 1359.05 | 0.129530 |
| 2017-05-01 00:01:00 | 1359.05 | 1359.05 | 1359.05 | 1359.05 | 0.015945 |
| 2017-05-01 00:02:00 | 1359.05 | 1359.05 | 1359.05 | 1359.05 | 0.430274 |
| 2017-05-01 00:03:00 | 1359.05 | 1359.05 | 1359.05 | 1359.05 | 0.024931 |
| 2017-05-01 00:04:00 | 1359.05 | 1359.05 | 1359.05 | 1359.05 | 0.000000 |
| ... | ... | ... | ... | ... | ... |
| 2017-05-01 11:55:00 | 1376.43 | 1376.43 | 1376.40 | 1376.43 | 0.064194 |
| 2017-05-01 11:56:00 | 1376.43 | 1377.36 | 1376.43 | 1377.36 | 0.013068 |
| 2017-05-01 11:57:00 | 1377.36 | 1377.36 | 1377.35 | 1377.35 | 0.023829 |
| 2017-05-01 11:58:00 | 1377.35 | 1378.25 | 1377.35 | 1378.24 | 17.260743 |
| 2017-05-01 11:59:00 | 1378.24 | 1378.24 | 1378.24 | 1378.24 | 0.000000 |
720 rows × 5 columns
We may want to calculate other values such as taking the middle value between the Low and High values, this will allow us to calculate an estimate for the value of the bitcoin traded
It will only be an estimate. since we don't have the exact list of trades available in this dataset!
We can perform calculations on the columns in exactly the same we we have with any other data set before
bitcoin['Middle'] = (bitcoin['High'] + bitcoin['Low']) / 2
bitcoin['USD Volume'] = bitcoin['Middle'] * bitcoin['Volume']
bitcoin
| Open | High | Low | Close | Volume | Middle | USD Volume | |
|---|---|---|---|---|---|---|---|
| Date | |||||||
| 2017-01-01 00:00:00 | 974.55 | 974.55 | 974.55 | 974.55 | 0.000000 | 974.550 | 0.000000 |
| 2017-01-01 00:01:00 | 974.55 | 974.55 | 974.55 | 974.55 | 0.000000 | 974.550 | 0.000000 |
| 2017-01-01 00:02:00 | 974.55 | 974.55 | 970.00 | 970.00 | 0.417679 | 972.275 | 406.099132 |
| 2017-01-01 00:03:00 | 970.00 | 970.00 | 970.00 | 970.00 | 0.000514 | 970.000 | 0.498755 |
| 2017-01-01 00:04:00 | 970.00 | 970.00 | 970.00 | 970.00 | 0.000000 | 970.000 | 0.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 2017-12-31 23:55:00 | 13782.87 | 13782.87 | 13771.00 | 13771.00 | 1.657980 | 13776.935 | 22841.878972 |
| 2017-12-31 23:56:00 | 13771.00 | 13775.00 | 13770.95 | 13775.00 | 7.538428 | 13772.975 | 103826.583413 |
| 2017-12-31 23:57:00 | 13775.00 | 13815.37 | 13775.00 | 13815.37 | 18.437304 | 13795.185 | 254346.017926 |
| 2017-12-31 23:58:00 | 13815.37 | 13825.00 | 13815.37 | 13825.00 | 3.547593 | 13820.185 | 49028.397646 |
| 2017-12-31 23:59:00 | 13825.00 | 13825.00 | 13804.68 | 13820.26 | 2.610719 | 13814.840 | 36066.660297 |
525600 rows × 7 columns
Sometimes we are interested in sampling the in larger chunks than it is presented
For instance, this data has one line per minute; we may want data for days instead
The method to resample the data in Python is called resample, and takes a time period as an argument
For example, 1D means 1 day (you can see more options here)
# Not very useful output right?
bitcoin.resample('1D')
<pandas.core.resample.DatetimeIndexResampler object at 0x00000185A60DD6A0>
If we call the mean method, it will average out the values of this new object by day
bitcoin.resample('1D').mean()
| Open | High | Low | Close | Volume | Middle | USD Volume | |
|---|---|---|---|---|---|---|---|
| Date | |||||||
| 2017-01-01 | 980.776378 | 980.898135 | 980.663396 | 980.793059 | 0.743065 | 980.780766 | 737.218302 |
| 2017-01-02 | 1016.424132 | 1016.618785 | 1016.247319 | 1016.439174 | 1.006385 | 1016.433052 | 1025.061363 |
| 2017-01-03 | 1023.624340 | 1023.743431 | 1023.514653 | 1023.634618 | 1.596420 | 1023.629042 | 1636.278683 |
| 2017-01-04 | 1082.933979 | 1083.422104 | 1082.550431 | 1083.009069 | 2.258822 | 1082.986267 | 2464.408073 |
| 2017-01-05 | 1056.542243 | 1057.577083 | 1055.411271 | 1056.446188 | 4.190673 | 1056.494177 | 4165.407749 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 2017-12-27 | 15639.538896 | 15655.889035 | 15621.917583 | 15639.252847 | 6.427652 | 15638.903309 | 99412.047834 |
| 2017-12-28 | 14261.058215 | 14285.193507 | 14237.371021 | 14260.412146 | 7.639668 | 14261.282264 | 108742.624129 |
| 2017-12-29 | 14535.871354 | 14549.550931 | 14522.446875 | 14535.931340 | 4.438627 | 14535.998903 | 64443.408129 |
| 2017-12-30 | 13476.331806 | 13494.028972 | 13454.185951 | 13475.012840 | 6.065731 | 13474.107462 | 80474.783127 |
| 2017-12-31 | 13274.262319 | 13285.760785 | 13262.565792 | 13275.057167 | 5.678043 | 13274.163288 | 75170.569715 |
365 rows × 7 columns
The method takes the values in the column Volume and rows 2017-01-01 00:00 to 2017-01-01 23:59 and averaging them
We can also get the total by adding the values
bitcoin.resample('1D').sum()
| Open | High | Low | Close | Volume | Middle | USD Volume | |
|---|---|---|---|---|---|---|---|
| Date | |||||||
| 2017-01-01 | 1.412318e+06 | 1.412493e+06 | 1412155.29 | 1.412342e+06 | 1070.013119 | 1.412324e+06 | 1.061594e+06 |
| 2017-01-02 | 1.463651e+06 | 1.463931e+06 | 1463396.14 | 1.463672e+06 | 1449.193752 | 1.463664e+06 | 1.476088e+06 |
| 2017-01-03 | 1.474019e+06 | 1.474191e+06 | 1473861.10 | 1.474034e+06 | 2298.844903 | 1.474026e+06 | 2.356241e+06 |
| 2017-01-04 | 1.559425e+06 | 1.560128e+06 | 1558872.62 | 1.559533e+06 | 3252.704150 | 1.559500e+06 | 3.548748e+06 |
| 2017-01-05 | 1.521421e+06 | 1.522911e+06 | 1519792.23 | 1.521283e+06 | 6034.569411 | 1.521352e+06 | 5.998187e+06 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 2017-12-27 | 2.252094e+07 | 2.254448e+07 | 22495561.32 | 2.252052e+07 | 9255.819566 | 2.252002e+07 | 1.431533e+08 |
| 2017-12-28 | 2.053592e+07 | 2.057068e+07 | 20501814.27 | 2.053499e+07 | 11001.121472 | 2.053625e+07 | 1.565894e+08 |
| 2017-12-29 | 2.093165e+07 | 2.095135e+07 | 20912323.50 | 2.093174e+07 | 6391.623337 | 2.093184e+07 | 9.279851e+07 |
| 2017-12-30 | 1.940592e+07 | 1.943140e+07 | 19374027.77 | 1.940402e+07 | 8734.652933 | 1.940271e+07 | 1.158837e+08 |
| 2017-12-31 | 1.911494e+07 | 1.913150e+07 | 19098094.74 | 1.911608e+07 | 8176.382068 | 1.911480e+07 | 1.082456e+08 |
365 rows × 7 columns
This operation was more useful for the Volume column, as it now sums up all of the entries for that day
We can see that 1070.013119 bitcoin (about $1,061,594 at the time) were traded on 01/01
In fact there was another way to get this!
Multipying the bitcoin per minute rate from before (0.743065) times the number of minute in a day (1440)
1440*0.743065
1070.0136
If we want the lowest or highest price for the whole day, we don't need an average or a sum!
We can tell Pandas to use max on these values, and also we can use min for Low
bitcoin.resample('1D').max()
| Open | High | Low | Close | Volume | Middle | USD Volume | |
|---|---|---|---|---|---|---|---|
| Date | |||||||
| 2017-01-01 | 1003.54 | 1003.54 | 1003.54 | 1003.54 | 110.126689 | 1003.540 | 1.102911e+05 |
| 2017-01-02 | 1035.89 | 1035.90 | 1035.88 | 1035.89 | 116.627861 | 1035.885 | 1.197103e+05 |
| 2017-01-03 | 1038.26 | 1038.26 | 1038.26 | 1038.26 | 405.000039 | 1038.260 | 4.150481e+05 |
| 2017-01-04 | 1150.00 | 1155.54 | 1149.97 | 1150.00 | 167.000000 | 1149.980 | 1.839179e+05 |
| 2017-01-05 | 1167.16 | 1167.16 | 1167.16 | 1167.16 | 918.000000 | 1167.160 | 8.822990e+05 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 2017-12-27 | 16500.00 | 16500.00 | 16499.58 | 16500.00 | 115.322939 | 16499.790 | 1.704437e+06 |
| 2017-12-28 | 15495.96 | 15499.99 | 15493.84 | 15495.96 | 143.885194 | 15494.420 | 2.005490e+06 |
| 2017-12-29 | 15077.92 | 15115.10 | 15060.75 | 15077.92 | 172.666517 | 15087.925 | 2.448388e+06 |
| 2017-12-30 | 14607.75 | 14607.77 | 14577.60 | 14607.75 | 168.027243 | 14592.685 | 2.205184e+06 |
| 2017-12-31 | 14180.00 | 14191.20 | 14163.09 | 14180.00 | 106.797334 | 14171.545 | 1.435948e+06 |
365 rows × 7 columns
bitcoin.resample('1D').min()
| Open | High | Low | Close | Volume | Middle | USD Volume | |
|---|---|---|---|---|---|---|---|
| Date | |||||||
| 2017-01-01 | 965.00 | 965.00 | 965.00 | 965.00 | 0.0 | 965.000 | 0.0 |
| 2017-01-02 | 995.00 | 995.01 | 995.00 | 995.00 | 0.0 | 995.010 | 0.0 |
| 2017-01-03 | 1014.34 | 1014.35 | 1014.06 | 1014.34 | 0.0 | 1014.345 | 0.0 |
| 2017-01-04 | 1035.03 | 1035.03 | 1035.02 | 1035.03 | 0.0 | 1035.030 | 0.0 |
| 2017-01-05 | 879.00 | 899.87 | 879.00 | 879.00 | 0.0 | 892.365 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 2017-12-27 | 14541.60 | 14599.62 | 14521.00 | 14541.60 | 0.0 | 14560.500 | 0.0 |
| 2017-12-28 | 13532.44 | 13579.80 | 13514.81 | 13532.44 | 0.0 | 13556.120 | 0.0 |
| 2017-12-29 | 13962.47 | 14000.02 | 13949.50 | 13962.47 | 0.0 | 13983.010 | 0.0 |
| 2017-12-30 | 12481.44 | 12508.19 | 12481.44 | 12481.44 | 0.0 | 12504.095 | 0.0 |
| 2017-12-31 | 12540.25 | 12555.11 | 12511.00 | 12540.25 | 0.0 | 12540.055 | 0.0 |
365 rows × 7 columns
To generate a new dataframe with both changes applied to the respective column in one line of code, we will use the resample, lambda and apply methods:
For High we will max of the values of High for each minute.
For Low we will use min.
For Volume and USD Volume we will use sum.
For Open we will use the first item (index [0]).
For Close we will use the last item (index [-1]).
daily = bitcoin.resample('1D').apply({'Open' : lambda x: x[0],
'High' : np.max,
'Low' : np.min,
'Close' : lambda x: x[-1],
'Volume' : np.sum,
'USD Volume' : np.sum})
daily
| Open | High | Low | Close | Volume | USD Volume | |
|---|---|---|---|---|---|---|
| Date | ||||||
| 2017-01-01 | 974.55 | 1003.54 | 965.00 | 998.57 | 1070.013119 | 1.061594e+06 |
| 2017-01-02 | 998.57 | 1035.90 | 995.00 | 1020.23 | 1449.193752 | 1.476088e+06 |
| 2017-01-03 | 1020.23 | 1038.26 | 1014.06 | 1035.03 | 2298.844903 | 2.356241e+06 |
| 2017-01-04 | 1035.03 | 1155.54 | 1035.02 | 1143.16 | 3252.704150 | 3.548748e+06 |
| 2017-01-05 | 1143.16 | 1167.16 | 879.00 | 1004.84 | 6034.569411 | 5.998187e+06 |
| ... | ... | ... | ... | ... | ... | ... |
| 2017-12-27 | 15830.86 | 16500.00 | 14521.00 | 15418.95 | 9255.819566 | 1.431533e+08 |
| 2017-12-28 | 15418.95 | 15499.99 | 13514.81 | 14488.61 | 11001.121472 | 1.565894e+08 |
| 2017-12-29 | 14488.61 | 15115.10 | 13949.50 | 14574.99 | 6391.623337 | 9.279851e+07 |
| 2017-12-30 | 14574.99 | 14607.77 | 12481.44 | 12675.68 | 8734.652933 | 1.158837e+08 |
| 2017-12-31 | 12675.68 | 14191.20 | 12511.00 | 13820.26 | 8176.382068 | 1.082456e+08 |
365 rows × 6 columns
Notice that the Close of one day still matches the Open of the next
Moreover, we have deliberately left out Middle since we want to recalculate it on a daily basis, as it will be different from what we calculate on a minute basis!
We will leave the USD Volume based on minute data since that will give a more accurate estimate and this column is not exact since we don't have the trades list
Let's recalculate Middle on this new daily data
daily['Middle'] = (daily['High'] + daily['Low']) / 2
daily
| Open | High | Low | Close | Volume | USD Volume | Middle | |
|---|---|---|---|---|---|---|---|
| Date | |||||||
| 2017-01-01 | 974.55 | 1003.54 | 965.00 | 998.57 | 1070.013119 | 1.061594e+06 | 984.270 |
| 2017-01-02 | 998.57 | 1035.90 | 995.00 | 1020.23 | 1449.193752 | 1.476088e+06 | 1015.450 |
| 2017-01-03 | 1020.23 | 1038.26 | 1014.06 | 1035.03 | 2298.844903 | 2.356241e+06 | 1026.160 |
| 2017-01-04 | 1035.03 | 1155.54 | 1035.02 | 1143.16 | 3252.704150 | 3.548748e+06 | 1095.280 |
| 2017-01-05 | 1143.16 | 1167.16 | 879.00 | 1004.84 | 6034.569411 | 5.998187e+06 | 1023.080 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 2017-12-27 | 15830.86 | 16500.00 | 14521.00 | 15418.95 | 9255.819566 | 1.431533e+08 | 15510.500 |
| 2017-12-28 | 15418.95 | 15499.99 | 13514.81 | 14488.61 | 11001.121472 | 1.565894e+08 | 14507.400 |
| 2017-12-29 | 14488.61 | 15115.10 | 13949.50 | 14574.99 | 6391.623337 | 9.279851e+07 | 14532.300 |
| 2017-12-30 | 14574.99 | 14607.77 | 12481.44 | 12675.68 | 8734.652933 | 1.158837e+08 | 13544.605 |
| 2017-12-31 | 12675.68 | 14191.20 | 12511.00 | 13820.26 | 8176.382068 | 1.082456e+08 | 13351.100 |
365 rows × 7 columns
Typically, the first plot used for this type of data is a line plot
import plotly.express as px
fig = px.line(daily.reset_index(), x='Date', y='Middle')
fig.show()
However, if we had the data of one more cryptocurrency (yes, there's more than bitcoin!), we could compare their trends and see which one is trading higher!
And that is why people came up with... RUNNING BAR CHARTS (RBC)!

Let's see how this example was produced using Python
We will use a file called city_populations.txt which you can download from here
This file contains population data of different world cities through time
df = pd.read_csv('https://gist.githubusercontent.com/johnburnmurdoch/4199dbe55095c3e13de8d5b2e5e5307a/raw/fa018b25c24b7b5f47fd0568937ff6c04e384786/city_populations')
df
| name | group | year | value | subGroup | city_id | lastValue | lat | lon | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Agra | India | 1575 | 200.0 | India | Agra - India | 200.0 | 27.18333 | 78.01667 |
| 1 | Agra | India | 1576 | 212.0 | India | Agra - India | 200.0 | 27.18333 | 78.01667 |
| 2 | Agra | India | 1577 | 224.0 | India | Agra - India | 212.0 | 27.18333 | 78.01667 |
| 3 | Agra | India | 1578 | 236.0 | India | Agra - India | 224.0 | 27.18333 | 78.01667 |
| 4 | Agra | India | 1579 | 248.0 | India | Agra - India | 236.0 | 27.18333 | 78.01667 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6247 | Vijayanagar | India | 1561 | 480.0 | India | Vijayanagar - India | 480.0 | 15.33500 | 76.46200 |
| 6248 | Vijayanagar | India | 1562 | 480.0 | India | Vijayanagar - India | 480.0 | 15.33500 | 76.46200 |
| 6249 | Vijayanagar | India | 1563 | 480.0 | India | Vijayanagar - India | 480.0 | 15.33500 | 76.46200 |
| 6250 | Vijayanagar | India | 1564 | 480.0 | India | Vijayanagar - India | 480.0 | 15.33500 | 76.46200 |
| 6251 | Vijayanagar | India | 1565 | 480.0 | India | Vijayanagar - India | 480.0 | 15.33500 | 76.46200 |
6252 rows × 9 columns
Let's clean this a little bit...
# These are the only columns I need
df = df[['name','group','year','value']]
df
| name | group | year | value | |
|---|---|---|---|---|
| 0 | Agra | India | 1575 | 200.0 |
| 1 | Agra | India | 1576 | 212.0 |
| 2 | Agra | India | 1577 | 224.0 |
| 3 | Agra | India | 1578 | 236.0 |
| 4 | Agra | India | 1579 | 248.0 |
| ... | ... | ... | ... | ... |
| 6247 | Vijayanagar | India | 1561 | 480.0 |
| 6248 | Vijayanagar | India | 1562 | 480.0 |
| 6249 | Vijayanagar | India | 1563 | 480.0 |
| 6250 | Vijayanagar | India | 1564 | 480.0 |
| 6251 | Vijayanagar | India | 1565 | 480.0 |
6252 rows × 4 columns
We will change the names of the columns to something more coherent!
df = df.rename(columns={"name": "City", "group": "Continent","year": "Year", "value":"Population"})
df
| City | Continent | Year | Population | |
|---|---|---|---|---|
| 0 | Agra | India | 1575 | 200.0 |
| 1 | Agra | India | 1576 | 212.0 |
| 2 | Agra | India | 1577 | 224.0 |
| 3 | Agra | India | 1578 | 236.0 |
| 4 | Agra | India | 1579 | 248.0 |
| ... | ... | ... | ... | ... |
| 6247 | Vijayanagar | India | 1561 | 480.0 |
| 6248 | Vijayanagar | India | 1562 | 480.0 |
| 6249 | Vijayanagar | India | 1563 | 480.0 |
| 6250 | Vijayanagar | India | 1564 | 480.0 |
| 6251 | Vijayanagar | India | 1565 | 480.0 |
6252 rows × 4 columns
Notice that in the Continent column we have India for some cities
If I'm correct, India is a sub-continent rather than a continent!
Let's see the continents listed here...
set(df['Continent'])
{'Asia', 'Europe', 'India', 'Latin America', 'Middle East', 'North America'}
Middle East and Latin America are also listed as continents!
# Cities in the "Indian continent"
print(set(df[df['Continent']=='India']['City']))
{'Vijayanagar', 'Lucknow', 'Mumbai', 'Ahmedabad', 'Bijapur', 'Kolkatta', 'Cuttack', 'Gauda', 'Delhi', 'Agra'}
# Cities with "Middle East" continent
print(set(df[df['Continent']=='Middle East']['City']))
{'Tabriz', 'Esfahan', 'Mashhad', 'Cairo', 'Fez'}
# Cities with "Middle East" continent
print(set(df[df['Continent']=='Latin America']['City']))
{'Rio de Janeiro', 'Mexico City', 'Sao Paulo', 'Buenos Aires'}
Let's first replace India -> Asia
df['Continent'] = df['Continent'].replace('India', 'Asia')
With the other two is slightly different, as these categories contain cities from different continents!
In fact I may need to change "one by one"
df['Continent'][df['City']=='Fez']='Africa'
df['Continent'][df['City']=='Cairo']='Africa'
df['Continent'] = df['Continent'].replace('Middle East', 'Asia')
df['Continent'][df['City']=='Mexico City']='North America'
df['Continent'] = df['Continent'].replace('Latin America', 'South America')
df
C:\Anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Anaconda\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Anaconda\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| City | Continent | Year | Population | |
|---|---|---|---|---|
| 0 | Agra | Asia | 1575 | 200.0 |
| 1 | Agra | Asia | 1576 | 212.0 |
| 2 | Agra | Asia | 1577 | 224.0 |
| 3 | Agra | Asia | 1578 | 236.0 |
| 4 | Agra | Asia | 1579 | 248.0 |
| ... | ... | ... | ... | ... |
| 6247 | Vijayanagar | Asia | 1561 | 480.0 |
| 6248 | Vijayanagar | Asia | 1562 | 480.0 |
| 6249 | Vijayanagar | Asia | 1563 | 480.0 |
| 6250 | Vijayanagar | Asia | 1564 | 480.0 |
| 6251 | Vijayanagar | Asia | 1565 | 480.0 |
6252 rows × 4 columns
It shows a warning, but it works!
# Show that the relacement has been successful
df[df['City']=='Mexico City']
| City | Continent | Year | Population | |
|---|---|---|---|---|
| 3508 | Mexico City | North America | 1952 | 3736.6 |
| 3509 | Mexico City | North America | 1953 | 3922.4 |
| 3510 | Mexico City | North America | 1954 | 4108.2 |
| 3511 | Mexico City | North America | 1955 | 4294.0 |
| 3512 | Mexico City | North America | 1956 | 4531.0 |
| ... | ... | ... | ... | ... |
| 3572 | Mexico City | North America | 2016 | 21172.8 |
| 3573 | Mexico City | North America | 2017 | 21346.6 |
| 3574 | Mexico City | North America | 2018 | 21520.4 |
| 3575 | Mexico City | North America | 2019 | 21694.2 |
| 3576 | Mexico City | North America | 2020 | 21868.0 |
69 rows × 4 columns
I will save the new dataset locally
df.to_csv(r'city_populations_updated.csv',index=False)
Now, let's try to do our own RBC with Python and the matplotlib package
# Importing the necessary packages
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
If we just plot city vs population using ax.barh, we will see the last value of population for each of the cities
fig, ax = plt.subplots(figsize=(15, 8))
ax.barh(df['City'], df['Population'])
<BarContainer object of 6252 artists>
To do a running bar chart, we need to define a given year and sort the cities from most to least populated (for that year)
current_year = 1500
df2 = (df[df['Year'].eq(current_year)]
.sort_values(by='Population',ascending=False))
df2
| City | Continent | Year | Population | |
|---|---|---|---|---|
| 276 | Beijing | Asia | 1500 | 672.0 |
| 6186 | Vijayanagar | Asia | 1500 | 500.0 |
| 1055 | Cairo | Africa | 1500 | 400.0 |
| 1791 | Hangzhou | Asia | 1500 | 250.0 |
| 5595 | Tabriz | Asia | 1500 | 250.0 |
| 1503 | Gauda | Asia | 1500 | 200.0 |
| 2142 | Istanbul | Europe | 1500 | 200.0 |
| 4682 | Paris | Europe | 1500 | 185.0 |
| 1537 | Guangzhou | Asia | 1500 | 150.0 |
| 3751 | Nanjing | Asia | 1500 | 147.0 |
| 1289 | Cuttack | Asia | 1500 | 140.0 |
| 1496 | Fez | Africa | 1500 | 130.0 |
Notice that in this dataset, we don't have all cities for each year so we would get different charts
# Plotting a basic bar chart
fig, ax = plt.subplots(figsize=(15, 8))
ax.barh(df2['City'], df2['Population'])
<BarContainer object of 12 artists>
To add colours and labels (for each continent), we will use a dictionary that pairs cities with their respective continent
colours = dict(zip(
['Europe', 'Asia', 'South America', 'North America', 'Africa'],
['#adb0ff', '#90d595','#aafbff', '#f7bb5f', '#eafb50']))
group_lk = df.set_index('City')['Continent'].to_dict()
group_lk
{'Agra': 'Asia',
'Ahmedabad': 'Asia',
'Amsterdam': 'Europe',
'Ayutthaya': 'Asia',
'Beijing': 'Asia',
'Berlin': 'Europe',
'Bijapur': 'Asia',
'Birmingham': 'Europe',
'Boston': 'North America',
'Buenos Aires': 'South America',
'Cairo': 'Africa',
'Chicago': 'North America',
'Cuttack': 'Asia',
'Delhi': 'Asia',
'Dhaka': 'Asia',
'Edirne': 'Europe',
'Esfahan': 'Asia',
'Fez': 'Africa',
'Gauda': 'Asia',
'Guangzhou': 'Asia',
'Hangzhou': 'Asia',
'Istanbul': 'Europe',
'Karachi': 'Asia',
'Kolkatta': 'Asia',
'Kyoto': 'Asia',
'Lisbon': 'Europe',
'London': 'Europe',
'Los Angeles': 'North America',
'Lucknow': 'Asia',
'Manchester': 'Europe',
'Mashhad': 'Asia',
'Mexico City': 'North America',
'Moscow': 'Europe',
'Mumbai': 'Asia',
'Nanjing': 'Asia',
'Naples': 'Europe',
'New York': 'North America',
'Osaka': 'Asia',
'Paris': 'Europe',
'Pegu': 'Asia',
'Philadelphia': 'North America',
'Rio de Janeiro': 'South America',
'Saint Petersburg': 'Europe',
'Sao Paulo': 'South America',
'Seoul': 'Asia',
'Shanghai': 'Asia',
'Suzhou': 'Asia',
'Tabriz': 'Asia',
'Tokyo': 'Asia',
'Venice': 'Europe',
'Vienna': 'Europe',
'Vijayanagar': 'Asia'}
Now we can use this pairing to get some colours in df2
fig, ax = plt.subplots(figsize=(15, 8))
df2 = df2[::-1] # flip values from top to bottom
# pass colours values to `color=`
ax.barh(df2['City'], df2['Population'], color=[colours[group_lk[x]] for x in df2['City']])
# iterate over the values to plot labels and values (Tokyo, Asia, 38194.2)
for i, (value, name) in enumerate(zip(df2['Population'], df2['City'])):
ax.text(value, i, name, ha='right')
ax.text(value, i-.25, group_lk[name], ha='right')
ax.text(value, i, value, ha='left')
# Add year right middle portion of canvas
ax.text(1, 0.4, current_year, transform=ax.transAxes, size=46, ha='right')
Text(1, 0.4, '1500')
This function will draw a more "stylish" function to create the last chart
# Frankly, I don't know what half of it does...
def draw_barchart(year):
dff = df[df['Year'].eq(year)].sort_values(by='Population', ascending=True).tail(10)
ax.clear()
ax.barh(dff['City'], dff['Population'], color=[colours[group_lk[x]] for x in dff['City']])
dx = dff['Population'].max() / 200
for i, (value, name) in enumerate(zip(dff['Population'], dff['City'])):
ax.text(value-dx, i, name, size=14, weight=600, ha='right', va='bottom')
ax.text(value-dx, i-.25, group_lk[name], size=10, color='#444444', ha='right', va='baseline')
ax.text(value+dx, i, f'{value:,.0f}', size=14, ha='left', va='center')
# ... polished styles
ax.text(1, 0.4, year, transform=ax.transAxes, color='#777777', size=46, ha='right', weight=800)
ax.text(0, 1.06, 'Population (thousands)', transform=ax.transAxes, size=12, color='#777777')
ax.xaxis.set_major_formatter(ticker.StrMethodFormatter('{x:,.0f}'))
ax.xaxis.set_ticks_position('top')
ax.tick_params(axis='x', colors='#777777', labelsize=12)
ax.set_yticks([])
ax.margins(0, 0.01)
ax.grid(which='major', axis='x', linestyle='-')
ax.set_axisbelow(True)
ax.text(0, 1.12, 'The most populous cities in the world ',
transform=ax.transAxes, size=24, weight=600, ha='left')
plt.box(False)
# Now we execute the function
fig, ax = plt.subplots(figsize=(15, 8))
draw_barchart(1988)
The most important part is to animate the plot for a range of years
import matplotlib.animation as animation
from IPython.display import HTML
fig, ax = plt.subplots(figsize=(15, 8))
animator = animation.FuncAnimation(fig, draw_barchart, frames=range(2000,2020))
HTML(animator.to_jshtml())
Besides the large amount of code, one of the biggest letdown of this is the lack of animation for the bars to "pass" each other!
Therefore,...
There are two parts, one for TSD and another for RBC
The first one consists on following the instructions to treat TSD data
The second one has the code to do the RBC
For that one, I want you to get creative!
Alternatively you can download the dataset from Moodle and see if other tools are able to these things!